Data overview

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this is often the most marginalised part of the population. The approach taken in this analysis takes a naive approach to understading consumer defaults due to the rather limited nature of the data. The data set used in this report includes consumer default instances, defined in the data as the TARGET variable. A default is defined to be the customer that has late payment more than 90 days on a given load or failed to repay the loan.

The borrower characteristics include quantative metrics such as income, credit amount, value of good purchased, days employed etc. It also include qualatative/catergorical meausures such as gender, education, home ownership, mobile ownership, number of children etc.

This analysis does not take into account financial inclusion for the unbanked population. In fronteir markets, its important to make sure that underserved populations has a positive loan experience and given everyone an fair assessment. This data does not take into account alternative data sources to predict customers repayment abilities.

Alternative data sources that could be incorporated into the model includes;

Import and interrogate data

The feature set includes borrower measures of credit usage, income, annuities, value of good purchases as well as qualative features such as education type, home ownership, birthday, gender etc. The response variable is a borrow DEFAULT (TARGET = 1).

The target variable takes the value 1 if someone experiences payment difficulties or fails to repay a loan.

There are some missing values that need to be imputed or removed from the data, namely 'annuity ammount', 'occupation type' and 'amount good price'. There are also columns that dont contain useful information, such as ID.

Continuous data features

The above chart shows that the continuous variables are relatively correlated, e.g. a high credit amount correlates with a high good price (correlation = ~0.9). This can be a problem in classification context; however less so in a machine learning context. The problem is that in practice, you need to explain the system’s behaviour, especially if it makes decisions. ML explainability is important so that intelligent technologies don’t inherit societal biases.

The chart above suggests there are some outliers in the data, most notably for 'total income ammount', these will be either imputed or removed from the data. The features identify a few key relationships;

Target variable

The target variable contains information on consumer defaults, where 1 represents a default and 0 otherwise.

The dataset is imbalanced, meaning the target class has an uneven distribution of observations, i.e. TARGET = 1 occurs less frequently than TARGET = 0. Imbalanced classification is primarily challenging due to the severely skewed class distribution. This may cause poor performance in machine learning models.

Categorical features

There are 10 categorical features in the data - its important to understand their relationship with the target variable;

The categorical predictive features appear relatively uncorrelated.

Transform and encode data

Before applying machine learning methods, catergorical variables needs to be encoded into numerial form.

Apply box-cox transformer to continuous features so that the resulting variable looks more normally distributed. This will help reduce skew in the raw variables.

Test and training datasets

Benchmark logistic model

First a logistic model is applied to the data to act as a baseline for performance evaluation. Logistic models are used to predict the probability that an observation falls into one of two categories; TARGET = 1 or TARGET = 0 based on the set of predictors (features).

Note the accuracy is 91%; however this is not a good measure due to the class imbalances in the data. The recall and precision score is 0% for TARGET = 1, i.e. the model is very bad at predicting true defaults, whilst the model is also misclassifying non-defaults as defaults.

Model fitting on raw data

In this section an array of classifiers are explored. Each model is fitted to the data using a ‘training’ sample. The ‘testing’ sample is used to evaluate each model’s performance in predicting credit defaults. The model fits can be seen in the Annex. Once the models have been fitted using the training data, each model’s performance can be evaluated when applied to unseen data (out-of-sample).The algorithms include logistic, K-nearest neighbours, random forest, decision trees, ada boost, naive Bayes and gradient boosted classifier.

The predictive performance on the imbalanced data is poor, with the naive bayes model outperforming with an F1 score of just 12.5%. The accuracy statistics are misleading just to the imbalanced nature of the data.

The decision tree performs best in terms of F1 score and recall i.e. the model is best at detecting defaults, whilst the random forest generated the best precision or best at detecting defaults out of all identified cases. That said, the model performance is poor and is not identifying defaults particuarly well.

The following confusion matrix and AUC charts show the poor predictive performance of the decision tree model.

The following helps identify which features are the most import in predicting default rates under the naive Bayes model.

The above chart shows that birthday, employement history, annuity, income and credit are all very important in predicting consumer default rates.

The essence of Shapley value is to measure the contributions to default score for each observation. Positive SHAP value means more likely to default. E.g. the chart below shows that higher values of goods price, the default rates are lower.